W1: Fundamentals

Welcome!

Please sign up for Google Classroom (link) if you haven’t already.

Introductions

  • Who am I?
  • What is DaSL?
  • Who are you?

    • Name, pronouns, group you work in

    • What you want to get out of the class

    • Something that keeps you going in the winter!

Goals of the course

  • Continue building programming fundamentals: How to use complex data structures, write your own functions, and how to iterate repeated tasks.
  • Continue exploration of data science fundamentals: how to clean messy data using the programming fundamentals above to a Tidy form for analysis.

A Motivating Example

Week 1 Classwork (in Classroom)

Tentative Schedule

Week Date Topic
1* Jan 21 Fundamentals
2 Jan 28 Iteration (for loops)
3* Feb 4 Conditionals
4 Feb 11 Functions
No Class Feb 18 Break week
5 Feb 25 Iteration Styles
6* Mar 4 Reference vs. Copy / Last Day of Class
  • = Ted on Campus for class

More About the Schedule

All classes are on Wednesdays from 12:00-1:30 PM PST either online, or in Arnold M1-B406 (The Data Science Lounge). Connection details will be provided. Office hours related to each class day are posted below, and the invite will be sent to you.

In class we will be going through the notebooks hosted on Google Classroom.

Classes will be recorded, and those recordings will be sent to you after each class.

Full course page here: https://hutchdatascience.org/Intermediate_Python/

Format of the course

  • Hybrid, and recordings will be available.
  • 1-2 hour exercises after each session are encouraged for practice; time at beginning of class provided to work on exercises.
  • Office Hours Fridays 10am - 11am PT.

Culture of the course

  • Challenge: We are learning a new language, but you already have a full-time job
  • Teach not for mastery, but teach for empowerment to learn effectively.

Culture of the course

  • Challenge: We sometimes struggle with our data science in isolation, unaware that someone two doors down from us has gone through the same struggle.
  • We learn and work better with our peers.
  • Know that if you have a question, other people will have it.

We ask you to follow Participation Guidelines and Code of Conduct.

Ready?

Data types

Data type name Data type shorthand Examples
Integer int 2, 4
Float float 3.5, -34.1009
String str “hello”, “234-234-8594”
Boolean bool True, False

There is a special None data type that shows up when nothing is returned from an expression.

Data structures

  • List

  • Dataframe

  • Dictionary

  • Tuple

Objects in Python

What does it contain?

  • Value that holds the essential data for the object.
  • Attributes that hold subset of the data or additional data for the object.

What can it do?

  • Functions called Methods specific to the data type and automatically takes the object as input.

This organizing structure on an object applies to pretty much all Python data types and data structures.

Lists as an Object

What does it contain?

  • Value: the contents of the list, such as [2, 3, 4].
  • Attributes: subsetting via [ ].

What can it do (methods)?

  • my_list.append(4) appends 4 to the last element of my_list, but does not return anything.

What’s the difference between a method and a function?

Dataframe as an Object

What does it contain?

  • Value: the spreadsheet of data.

What can it do (methods)?

  • .head()

  • .tail()

Break!

A quick survey to get us started: https://forms.gle/cYEWWhk1aagsefPA7

Dictionary

A dictionary is designed as a lookup table, organized in key-value pairs. You associate the key with a particular value, and use the key to find the value.

sentiment = {'happy': 8, 'sad': 2, 'joy': 7.5, 'embarrassed': 3.6, 'restless': 4.1, 'apathetic': 3.8, 'calm': 7}
sentiment
{'happy': 8,
 'sad': 2,
 'joy': 7.5,
 'embarrassed': 3.6,
 'restless': 4.1,
 'apathetic': 3.8,
 'calm': 7}

You use a key to find its corresponding value:

sentiment['joy']
7.5

You cannot use a numerical index to find values, like you did for Lists!

#sentiment[0]

Rules of Dictionaries

  • Only one value per key. No duplicate keys allowed.
  • Keys must be of string, integer, float, boolean, or tuple.
  • Values can be of any type, including data structures such as lists and dictionaries.
duplicated_keys = {'Student' : 97, 'Student': 88, 'Student' : 91}
duplicated_keys
{'Student': 91}
child = {"name" : "Emil", "year" : 2004, "likes": ["jumping", "skating", "laughing"]}
child["likes"][1]
'skating'

Data stored in nested dictionaries are often represented as JSON files. Here’s a guide on using JSON files in Python.

Using key to find values

sentiment['joy'] 
7.5
sentiment['joy'] = sentiment['joy'] + 1

If a key doesn’t exist, you will get an error:

#sentiment["dog"]

If you don’t want to run the risk of getting an error, you can specify a default value using the .get() method.

sentiment.get("dog", "not found")
'not found'
print(sentiment.get("dog"))
None

Adding new key-value pairs

You can add more key-value pairs by defining it directly. If the key already exists, the mapping for that key will simply be updated.

sentiment['dog'] = 5

Application: Creating a Dataframe

You can create a Dataframe using a Dictionary. The key represent column names, and the value is a List containing the column’s values:

import pandas as pd

simple_df = pd.DataFrame(data={'id': ["AAA", "BBB", "CCC", "DDD", "EEE"],
                               'case_control': ["case", "case", "control", "control", "control"],
                               'measurement1': [2.5, 3.5, 9, .1, 2.2],
                               'measurement2': [0, 0, .5, .24, .003],
                               'measurement3': [80, 2, 1, 1, 2]})
                               
simple_df
id case_control measurement1 measurement2 measurement3
0 AAA case 2.5 0.000 80
1 BBB case 3.5 0.000 2
2 CCC control 9.0 0.500 1
3 DDD control 0.1 0.240 1
4 EEE control 2.2 0.003 2

Application: Data Recoding

You want to take “case_control” column of simple_df and change “case” to “experiment” and “control” to “baseline”.

This correspondence relationship can be stored in a dictionary via .replace() method for Series:

simple_df.case_control.replace({"case": "experiment", "control": "baseline"})
0    experiment
1    experiment
2      baseline
3      baseline
4      baseline
Name: case_control, dtype: object

You can do something similar to recode the column names of a Dataframe via the .rename() method.

That’s It for Today

Weekly Checkin (use for any questions you might have, pacing, etc.): https://forms.gle/FUsPhbs6Nu2eGbCD6